Shuffle less for SSE4.1#137
Conversation
Which |
| pack8(7, 6, 5, 4, 3, 2, 1, 0)}; | ||
| // We will read from bswap at offsets for fixed format output. Accomodate | ||
| // the necessary offset calculations by using char*. | ||
| // This is aligned to 64bits, so we won't cross a cache line when reading, |
There was a problem hiding this comment.
Good eyes! Will correct
The second one which reads the data immediately written before. This dominated the profile. I will commit a simplification of the evaluation of the new stuff in float_layout tonight. It's probably not surprising that the AI stuff was a bit too roundabout. I will then split this in, probably, three patches which can be benchmarked in sequence: 1. Move writing |
shift_extra removed, make calculations more understandable.
Staged too early
|
@vitaut so I made the mistake of benchmarking not only with gcc but also with clang while reworking the patch: the whole thing is a bit of a mirage, with clang this is actually a pessimization, and the speed up with gcc only brings it to up to par with the pesssmized clang. So, I think this patch is misguided, and the apparent gain is actually due to gcc being led away from some questionable choices it makes in the original version. OTOH thinking about this, it became clear that I should just do what I had considered but dismissed, namely put together the string and put the decimal point in the middle while still inside the SSE register, and only do a single store to memory (+ extra_digit and last_digit). This seems to work well, and I hope to have it ready for submission later this week. |
|
Thanks for the detailed analysis. This sounds like a good plan and I look forward to the updated version. |
In the current code, we shuffle to reverse the byte order and we shuffle to shift the bits. This combines the two. It also replaces a slow
memmovefor the decimal digits with another shuffle + store, so that the data is never re-read from memory.This is my first AI-assisted patch to zmij, and the AI's main task was to figure out the special cases when moving the writing of the last digit behind the shuffles. Previously, the last_digit only sometimes was part of the memmove and could thus end up in a different position. The AI suggested to keep a table of where it goes, and so I did that. A table-free approach made the compiler add a branch and lose time.
The way I avoided to increase the table size for the various bswap + shift variations and thus messing up alignments or introducing padding is quite simple: what we want to do is to reverse the order and drop the first N digits, where N depends on whether its the integral part or the decimals, and it depends on the exponent. We don't care what is added in the end past the actual digits. So, it doesn't actually matter what the last few entries in the shuffler are, and so we just keep the same bswap shuffle table, read it at an offset and don't care what the entries at the end end up being. This only works because we reverse at the same time, so this cannot be carried to the NEON code. The additional bytes are loaded from the same struct, so it's not uninitialized memory and ASAN is happy, but I'm not sure what the standard says about this. I don't think the SIMD memory access is in its scope. BTW I changed the type of
bswaptochar *in order to get simpler index algebra and less casts.Finally, in order to keep everything branch-free, in the case of
dec_exp < 0where no decimal point is inserted between the digits, the identical shuffle + store operation is simply repeated. This turned out to be measurably faster than conditionally jumping past it. Note that the zmij/canada benchmark, given Canada's geography, doesn't contain any numbers < 1, so I created a separate benchmark (not included) that uses random fixed numbers. To my surprise it was actually faster than the processing of the Canada data. A possible explanation is that it triggered thelast_digitfar less often but I didn't check this in detail.On my Zen 4 this saves 5%, on my Tiger Lake it is performance neutral.
I also asked Claude to do try the same thing for NEON but none of the variations it tried came out faster than what is there right now.
ps I just saw that I didn't clean this up fully. Hence the non-squashed three-patch series with one unrelated change that I didn't mean to commit as part of this: I moved the
alignas(64)annotation from thestatic_datadeclaration to thestruct datadeclaration. The idea being that this way all the pointers that we use remember the alignment. Turns out the assembly with and without this change is the same on x64 https://godbolt.org/z/Mxz1fbKjY and ARM64 https://godbolt.org/z/ecnPzTWhK, but I think it's cleaner, so I'm not going to remove it right now even though it is unrelated.